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Abstract — This paper presents new lower and upper bounds 
for the optimal compression of binary prefix codes in terms of 
the most probable input symbol, where compression efficiency is 
determined by the nonlinear codeword length objective of mini- 
mizing maximum pointwise redundancy. This objective relates to 
both universal modeling and Shannon coding, and these bounds 
are tight throughout the interval. The upper bounds also apply 
to a related objective, that of d ,h exponential redundancy. 

I. Introduction 

A lossless binary prefix coding problem takes a probability 
mass function p(i), defined for all i in the input alphabet X, 
and finds a binary code for X. Without loss of generality, 
we consider an n-item source emitting symbols drawn from 
the alphabet X = {1, 2, ... ,71} where {p(i)} is the sequence 
of probabilities for possible symbols (jp(i) > for i G X 
and J2iexP(^) = ^ m monotonically nonincreasing order 
(p(*) > p(j) f° r * < j)- The source symbols are coded into 
binary codewords. The codeword c(i) <G {0,1}* in code c, 
corresponding to input symbol i, has length l(i), defining 
length vector Z. 

The goal of the traditional coding problem is to find a prefix 
code minimizing expected codeword length ^2 ieX p(i)l(i), or, 
equivalently, minimizing average redundancy 

R(l,p) 4 £>(i)l(0 - H(p) = 5>(i)(*(i) + Igp(t)) 
iex iex 

where H is — J2iex ^SP(*)> Shannon entropy, and lg = 
log 2 . A prefix code is a code for which no codeword begins 
with a sequence that also comprises the whole of a second 
codeword. This problem is equivalent to finding a minimum- 
weight external path 

5>(OJ(o 

iex 

among all rooted binary trees, due to the fact that every 
prefix code can be represented as a binary tree. In this tree 
representation, each edge from a parent node to a child node 
is labeled (left) or 1 (right), with at most one of each type of 
edge per parent node. A leaf is a node without children; this 
corresponds to a codeword, and the codeword is determined 
by the path from the root to the leaf. Thus, for example, a leaf 
that is the right-edge (1) child of a left-edge (0) child of a left- 
edge (0) child of the root will correspond to codeword 001. 
Leaf depth (distance from the root) is thus codeword length. 



The weights are the probabilities (i.e., w(i) = p(i)), and, in 
fact, we will refer to the problem inputs as {w(i)} for certain 
generalizations in which their sum, J^iex w (0> nee d not he 1. 

If formulated in terms of Z, the constraints on the mini- 
mization are the integer constraint (i.e., that codes must be of 
integer length) and the Kraft inequality [1]; that is, the set of 
allowable codeword length vectors is 

£„ = jz £ 1\ such that ^2 -J(<) < 1 j . 

Drmota and Szpankowski [2] investigated a problem 
which, instead of minimizing average redundancy R(l,p) = 
^2 i( z X p{i)(l{i) + lgp(i)), minimizes maximum pointwise re- 
dundancy 

R*(l,p)±max(l(i)+lgp(i)). 

%ex 

Related to a universal modeling problem [3, p. 176], the idea 
here is that, given a symbol to be compressed, we wish the 
length of the compressed data to exceed self-information 
(— \gp(i)) by as little as possible, and thus consider the 
worst case in this regard. This naturally relates to Shannon 
coding, as a code with lengths ["— \gp(i)~\ would never exceed 
self-information by more than 1 bit. Any solution, then, 
would necessarily have no codeword longer than its Shannon 
code counterpart. Indeed, Drmota and Szpankowski used a 
generalization of Shannon coding to solve the problem, which 
satisfies 

< R*{l op \p) < 1. 

We will improve the bounds, given p(l), for minimum maxi- 
mum pointwise redundancy and discuss the related issue of the 
length of the most likely codeword in these coding problems. 
These bounds are the first of their kind for this objective, 
analogous to those for traditional Huffman coding [4]-[9] and 
other nonlinear codes [10]— [12]. 

The bounds are derived using an alternative solution to this 
problem, a variation of Huffman coding [13] derived from that 
in [14]. In order to explain this variation, we first review the 
Huffman algorithm and some of the ways in which it can be 
modified. 

It is well known that the Huffman algorithm [15] finds a 
code minimizing average redundancy. The Huffman algorithm 
is a greedy algorithm built on the observation that the two 
least likely symbols will have the same length and can thus 



be considered siblings in the coding tree. A reduction can thus 
be made in which the two symbols with weights w(i) and w(j) 
can be considered as one with combined weight w(i) + w(j), 
and the codeword of the combined item determines all but the 
last bit of each of the items combined, which are differentiated 
by this last bit. This reduction continues until there is one item 
left, and, assigning this item the null string, a code is defined 
for all input symbols. In the corresponding optimal code tree, 
the i th leaf corresponds to the codeword of the i th input item, 
and thus has weight w(i), whereas the weight of parent nodes 
are determined by the combined weight of the corresponding 
merged item. Van Leeuwen gave an implementation of the 
Huffman algorithm that can be accomplished in linear time 
given sorted probabilities [16]. Shannon [17] had previously 
shown that an optimal Z opt must satisfy 

H{p) < J2p(i)l opt (i) < H{p) + 1, i.e., < R{l opt ,p) < 1. 

iex 

Simple changes to the Huffman algorithm solve several re- 
lated coding problems which optimize for different objectives. 
Generalized versions of the Huffman algorithm have been 
considered by many authors [18]— [21]. These generalizations 
change the combining rule; instead of replacing items i and j 
with an item of weight w(i) + w(j), the generalized algorithm 
replaces them with an item of weight f(w(i),w(j)) for some 
function /. Thus the weight of a combined item (a node) 
no longer need be equal to the sum of the probabilities of 
the items merged to create it (the sum of the leaves of the 
corresponding subtree). This has the result that the sum of 
weights in a reduced problem need not be 1, unlike in the 
original Huffman algorithm. In particular, the weight of the 
root, w root , need not be 1. However, we continue to assume 
that the sum of p(-), the inputs before reduction, will always 
be 1. 

One such variation of the Huffman algorithm was used 
in Humblet's dissertation [22] for a queueing application 
(and further discussed in [18], [19], [23]). The problem this 
variation solves is as follows: Given probability mass function 
p and a > 1, find a code minimizing 

L a (p,i)=log Q 5>Ma' W - (1) 

iex 

This growing exponential average problem is solved by using 
combining rule 

f(w(i),w(j)) = aw(i) +aw(j). (2) 

This problem was proposed (without solution) by Campbell 
[24], who later noted that this formulation can be extended to 
decaying exponential base a € (0, 1) [25]; Humblet noted that 
the Huffman combining method (O finds the optimal code for 
© with a e (0, 1) as well [23]. 

Another variation, proposed in [26] and solved for in [19], 
can be called <f h exponential redundancy [13], and is the 
minimization of the following: 

R d (l,p)±±lgJ2pd) 1+d 2 dm - 

iex 



Here we assume that d > 0, although d G (—1,0) is also a 
valid problem. Clearly, this can be solved via reduction to (Q]) 
by assigning a = lg d and using input weights w(i) = p(i) 1+d . 

Minimizing maximum redundancy is equivalent to minimiz- 
ing d th exponential redundancy for d — > oo. This observation 
leads to a Huffman-like solution with the combination rule 

f(w(i),w(j)) = 2max(w(i),w(j)) (3) 

as in [13]. 

In the next section, we find tight exhaustive bounds for the 
values of optimal R*(l,p) and corresponding 1(1) in terms 
of p(l), then find how we can extend these to exhaustive — 
but not tight — bounds for optimal R d (l,p). 

II. Bounds on the Redundancy Problems 

It is useful to come up with bounds on the performance of 
an optimal code, often in terms of the most probable symbol, 
p(l). In minimizing average redundancy, such bounds are 
often referred to as "redundancy bounds" because they are in 
terms of this average redundancy, R(l,p) = J2iex -PWM — 
H(p). The simplest bounds for the optimal solution to the 
minimum maximum pointwise redundancy problem 

Kptip) ~ min max (l(i) + lgp(i)) 

can be combined with those for the average redundancy 
problem: 

< RoM < R* opt (p) < 1 (4) 

where R opt (p) is the average redundancy of the average 
redundancy-optimal code. The average redundancy case is a 
lower bound because the maximum (R*(l,p)) of the values 
(l(i) + \gp(i)) that average to a quantity {R(l,p)) can be no 
less than the average (a fact that holds for all I and p). The 
upper bound is found similarly to the average redundancy case; 
we can note that Shannon code l a p {i) = \—lgp(i)~\ results in 
Kptip) < R*(l° P ,P) = max ie ^(r-lgp(i)l +lgp(i)) < 1. 

A few observations can be used to find a series of improved 
lower and upper bounds on optimum maximum pointwise 
redundancy based on 

Lemma 1: Suppose we apply (O to find a Huffman-like 
code tree in order to minimize maximum pointwise redun- 
dancy. Then the following holds: 

1) Items are always merged by nondecreasing weight. 

2) The weight of the root u> roo t of the coding tree deter- 
mines the maximum pointwise redundancy, R*(l,p) = 

lg Wroot- 

3) The total probability of any subtree is no greater than 
the total weight of the subtree. 

4) If p(l) < 2p(n — 1), then a minimum maximum 
pointwise redundancy code can be represented by a 
complete tree, that is, a tree with leaves at depth [lg n\ 
and [lgn] only (with J2 t ex 2 ~ m = r >- 

Proof: We use an inductive proof in which base cases 
of sizes 1 and 2 are trivial, and we use weights w, instead of 
probabilities p, to emphasize that the sums of weights need 
not necessarily add up to 1. Assume first that all properties 



here are true for trees of size n — 1 and smaller. We wish to 
show that they are true for trees of size n. 

The first property is true because f(w(i),w(j)) = 
2 max(«j(i), w(j)) > w(i) for any i and j; that is, a compound 
item always has greater weight than either of the items 
combined to form it. Thus, after the first two weights are 
combined, all remaining weights, including the compound 
weight, are no less than either of the two original weights. 

Consider the second property; after merging the two least 
weighted of n (possibly merged) items, the property holds for 
the resulting n — 1 items. For the n — 2 untouched items, 
+ \gw(i) remains the same. For the two merged items, 
let l(n — 1) and w(n — 1) denote the maximum depth/weight 
pair for item n — 1 and l(n) and w(n) the pair for n. If V and 
w' denote the depth/weight pair of the combined item, then 
l'+\gw' = l{n) — l+lg(2 max(w;(n— 1), w(n))) = max(Z(n— 
l)+\gw(n — 1), l(n)+lgw(n)), so the two trees have identical 
maximum redundancy, which is equal to lg ui roo t since the root 
node is of depth 0. Consider, for example, p = (0.5, 0.3, 0.2), 
which has optimal codewords with lengths I = (1,2,2). The 
first combined pair has I' + \gw' = 1 + lg 0.6 = max(2 + 
lg 0.3, 2 + lg 0.2) = max(;(2) + lgp(2), 1(3) + lgp(3)). This 
value is identical to that of the maximum redundancy, lg 1.2 = 

lg W roo t- 

For the third property, the first combined pair yields a weight 
that is no less than the combined probabilities. Thus, via 
induction, the total probability of any (sub)tree is no greater 
than the weight of the (sub)tree. 

In order to show the final property, first note that 
^2 ieX 2 _i W = 1 for any tree created using the Huffman-like 
procedure, since all internal nodes have two children. Now 
think of the procedure as starting with a queue of input items, 
ordered by nondecreasing weight from head to tail. After 
merging two items, obtained from the head of the queue, into 
one compound item, that item is placed back into the queue 
as one item, but not necessarily at the tail; an item is placed 
such that its weight is no smaller than any item ahead of it and 
is smaller than any item behind it. In keeping items ordered, 
this results in an optimal coding tree. A variant of this method 
can be used for linear-time coding [13]. 

In this case, we show not only that an optimal complete 
tree exists, but that, given an 7i-item tree, all items that finish 
at level [lgn] appear closer to the head of the queue than 
any item at level [lg n \ — 1 (if any), using a similar approach 
to the proof of Lemma 2 in [27]. Suppose this is true for 
every case with n — 1 items for n > 2, that is, that all nodes 
are at levels |_lg(rz. — 1)J or [lg(n — 1)], with the latter items 
closer to the head of the queue than the former. Consider now 
a case with n nodes. The first step of coding is to merge 
two nodes, resulting in a combined item that is placed at the 
end of the combined-item queue, as we have asserted that 
p(l) — 2p(n — 1) = 2max(p(n — l),p(n)). Because it is at 
the end of the queue in the n — 1 case, this combined node 
is at level |_lg(n — 1)J in the final tree, and its children are at 
level 1 + [lg(n — 1)J = [lgn]. If n is a power of two, the 
remaining items end up on level lgn = [lg(n — 1)], satisfying 



this lemma. If n — 1 is a power of two, they end up on level 
lg(n— 1) = Llg ?t,J , also satisfying the lemma. Otherwise, there 
is at least one item ending up at level [lgn] = [lg(n — 1)] 
near the head of the queue, followed by the remaining items, 
which end up at level [lgn, J = [lg(n — 1)J. In any case, all 
properties of the lemma are satisfied for n items, and thus for 
any number of items. ■ 

We can now present the improved redundancy bounds. 

Theorem 1: For any distribution in which p(\) > 2/3, 
R*( P ) = l+lgp(l). If p(l) G [0.5,2/3), then R* (p) G 



[l-i-lgp(l), 2+lg(l — p{l))) and these bounds are tight. Define 
A = [— lgp(l)], which, forp(l) G (0,0.5), is greater than 1. 
For this range the following bounds for i?* pt (p) are tight: 



P(l) 



i i 

2*> 2 A -1 



A + lgp(l),l + lg^ 



i-p(i) 



2 A -1 ' 2 A + 1 

2 1 

2 A + 1 ' 2 A ~! 



i-p(i) i i i„ i-p(i) 

_ 2 -A+l 5 ' o i_2- A 



lg T i^ r ,A + lgp(l) 



Proof: The key here is generalizing the simple bounds 
of®. 



Upper bound: Let us define what we call a first-order 
Shannon code: 

' A = [- lgp(l)] , i=l 

-lg(p(z)(i^))j, ie{2,3,...,n} 

This code, previously presented in the context of finding aver- 
age redundancy bounds given any probability [28], improves 
upon the original "zero-order" Shannon code Z° by taking the 
length of the first codeword into account when designing the 
rest of the code. The code satisfies the Kraft inequality, and 
thus, as a valid code, its redundancy is an upper bound on the 
redundancy of an optimal code. Note that 

maxi>i(^p(«) + lgp(0) 



= maxj>i 
< 1 + 1) 



lg- 



i-p(i) 



\gp{i) 



P«(l-2- A ) 

l-2- A ■ 

If p(l) G [2/(2 A + 1),1/2 A_1 ), the maximum pointwise re- 
dundancy of the first item is no less than l+lg((l— p(l))/(l— 
2- A )), and thus R* pt (p) < R*(l]„p) = A+lgp(l). Otherwise, 
R* op t(p) < nrfop) < 1 + lg((l -p(l))/(l - 2- A )). 
The tightness of the upper bound in [0.5, 1) is shown via 

p=(p(l),l-p(l)-e,e) 

for which the bound is achieved in [2/3, 1) for any e G (0, (1 — 
p(l))/2] and approached in [0.5,2/3) as e | 0. If A > 1 and 
p(l) G [2/(2 A + 1), 1/2 A_1 ), use probability mass function 

/ \ 

•p(l)-e l-p(l) 



P(l), 



2 A 



2 A 



V 



/ 



where 



ee (0,l-p(l)2 A - 1 ). 



Because p(l) > 2/(2 A + 1), 1 - p(l)2 x - 1 < (1 - p(l) - 
e)/(2 A - 2), and p(n - 1) > p(n). Similarly, p(l) < 1/2 A-1 
assures that p(\) > p(2), so the probability mass function 
is monotonic. Since 2p(n — 1) > p(l), by Lemma Q] an 
optimal code for this probability mass function is l(i) = A 
for all i, achieving R*(l,p) = A + lgp(l), with item 1 having 
the maximum pointwise redundancy. 

This leaves only p(l) G [1/2 A ,2/(2 A + 1)), for which we 
consider 



p(l), 



2 X - 1 



2 A - 1 



V 



\ 



/ 



where e J, 0. This is a monotonic probability mass function for 
sufficiently small e, for which we also have p(l) < 2p(n — 1), 
so (again from Lemma [TJ this results in optimal code where 
l(i) = A for i G {1,2, ...,n-2} and Z(n-l) = Z(n) = A + l, 
and thus the bound is approached with item n — 1 having the 
maximum pointwise redundancy. 

Lower bound: Consider all optimal codes with 1(1) = fx 
for some fixed fj, G {1,2,...}. If p(l) > 2~», R*(l,p) > 
1(1) + lgp(l) = ^ + lgp(l)- If < consider the 

weights at level /i (i.e., /i edges below the root). One of these 
weights is p(l), while the rest are known to sum to a number 
no less than 1 — p(l). Thus at least one weight must be at least 
(l-p(l))/(2"-l) and R*(l,p) > p+]g((l-p(l))/(2"-l)). 
Thus, 

K P t(p) > P + Igmax L(l), 
for 1(1) = n, and, since /i can be any positive integer, 

1-P(1) 
2^-1 



K P t (p) > r min , ( P + lg max ( p(l) 

(UG{1,2,3,...} 



which is equivalent to the bounds provided. 

For p(l) G [l/(2^ +1 - 1), 1/2^) for some fj,, consider 



P(l), 



1-P(1) 
2^+! - 2 



1-P(1) 



2^H 



V 



2A»+i_2 



/ 



By Lemma [T] this will have a complete coding tree and thus 
achieve the lower bound for this range (A = \i + 1). Similarly 

p(l),2-^ 1 ,...,2-^ 1 ,2-^p(l) 



V 



2^+1-2 



has a fixed-length optimal coding tree for p(l) 6 
[1/2 M , 1/(2 M — 1)), achieving the lower bound for this range 
(A = n). M 



0.1 0.2 0.3 0.4 0.5 0.6 

PW 



Fig. 1. Tight bounds on minimum maximum pointwise redundancy, includ- 
ing achievable upper bounds (solid), approachable upper bounds (dashed), 
achievable lower bounds (dotted), and fully determined values forp(l) > 2/3 
(dot-dashed). 



Note that the bounds of (|4|i are identical to the tight bounds 
at powers of two. In addition, the tight bounds clearly approach 
and 1 as p(l) J. 0. This behavior is in stark contrast with 
average redundancy, for which bounds get closer, not further 
apart, due to Gallager's redundancy bound [4] — R opt (p) < 
p(l) + 0.086 — which cannot be significantly improved for 
small p(l) [9]. Moreover, approaching 1, the upper and lower 
bounds on minimum average redundancy coding converge but 
never merge, whereas the minimum maximum redundancy 
bounds are identical for p(l) > 2/3. 

In addition to finding redundancy bounds in terms of p(l), 
it is also often useful to find bounds on the behavior of 1(1) 
in terms of p(l), as was done for optimal average redundancy 
in [29]. 

Theorem 2: Any optimal code for probability mass function 
p, where p(l) > 2~' y , must have 1(1) < v. This bound is 
tight, in the sense that, for p(l) < 2~ u , one can always find a 
probability mass function with Z(l) > v. Conversely, if p(l) < 
1/(2" — 1), there is an optimal code with 1(1) > v, and this 
bound is also tight. 

Proof: Suppose p(l) > 2~ v and 1(1) > 1 + v. Then 
R o P t(p) = R*(l,P) > '(1) + lgp(l) > 1, contradicting the 



simple bounds of (3). Thus 1(1) < v. 

For tightness of the bound, suppose p(l) G (2~" -1 ,2~ 
and consider n = 2 V+1 and 



") 



P(l),2- 



.,2-"" \2-»-p(l) 



If 1(1) < v, then, by the Kraft inequality, one of 1(2) through 
l(n — 1) must exceed v. However, this contradicts the simple 
bounds of (@). For p(l) = 2~ v ~ l , a uniform distribution 
results in 1(1) = v + 1. Thus, since these two results hold 



for any v, this extends to all p(l) < 2 v 1 , and this bound is 
tight. 

Suppose p(l) < 1/(2" — 1) and consider an optimal length 
distribution with 1(1) < v. Consider the weights of the nodes 
of the corresponding code tree at level £(1). One of these 
weights is p(l), while the rest are known to sum to a number 
no less than 1 — p(l). Thus there is one node of at least weight 

l-p(l) 

2Ki) - 1 - 2 ; M - 2'( 1 )+ 1 -" 
and thus, taking the logarithm and adding 1(1) to the right- 
hand side, 

Note that 1(1) + 1 + lgp(l) < v + lgp(l) < v - 1 + lg((l - 
p(l))/(2"- 1 -l)), a direct consequence of p(l) < 1/(2" -1). 
Thus, if we replace this code with one for which 1(1) = v, 
the code is still optimal. The tightness of the bound is easily 
seen by applying Lemma [T] to distributions of the form 

/ \ 

l-p(l) l-p(l) 



P 



p(l), 



2" 



\ 



7 



for p(l) g (1/(2" 
and thus R* opt (p) 



2"-2 

1), 1/2"- 1 ). This results in 1(1) =v-\ 
v + lg(l - p(l)) - lg(2" - 2), which no 
code with 1(1) > v — 1 could achieve. ■ 
In particular, if p(l) > 0.5, 1(1) = 1, while if 1(1) < 1/3, 
there is an optimal code with 1(1) > 1. 

We now briefly address the d th exponential redundancy 
problem. Recall that this is the minimization of 



R d (p,l)±\\g 



■ l+d2<ll(i) 



This can be rewritten as 

R d ( P ,l) = hgY,Pm d 



(;(i)+igp(0) 



iex 



A straightforward application of Lyapunov's inequality for 
moments yields R c (p,l) < R d (p,l) for c < d, which, taking 
limits to and oo, results in 

< R(p,l) < R d (pJ) < R*(p,l) < 1 
for any valid p, d > 0, and I, resulting in an extension of 

< R op t( P ) < R d opt ( P ) < R* opt ( P ) < 1 

where R d pt (p) is the optimal d th exponential redundancy, an 
improvement on the bounds found in [13]. This implies that 
this problem can be bounded in terms of the most likely 
symbol using the upper bounds of Theorem Q] and the lower 
bounds of average redundancy (Huffman) coding [7]: 

R op t > f - (1 - p(l)) lg(2« - 1) - H(p(l), 1 - p(l)) 
where 

1 _ 2p(i)-i 



i 



lp 



p(l) 

1 - 2ft 1 )- 1 
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